En las siguientes preguntas describe tus procedimientos y escribe las respuestas explícitamente (que no haya necesidad de correr código para obtenerlas).
Incluye el código.
No está permitido discutir el exámen fuera de su equipo.
Para dudas (que no deben incluir info. de la respuesta) se preguntará en el canvas del grupo.
Consideren el material de visualización al hacer sus gráficas y reportar sus resultados.
1.1 De acuerdo a una encuesta en EUA, 26% de los residentes adultos de Illinois han terminado la preparatoria. Un investigador sospecha que este porcentaje es menor en un condado particular del estado. Obtiene una muestra aleatoria de dicho condado y encuentra que 69 de 310 personas en la muestra han completado la preparatoria. Estos resultados soportan su hipótesis? (describe tu elección de prueba de hipótesis, valor p y conclusión).
# params
PROP_HIGH_SCHOOL <- 0.26 # claim
num_high_school <- 69 # observed number of high school graduates
num_sample <- 310 # sample size
prop_investigator <- num_high_school / num_sample # sample proportion
# test statistic
# H0: p = 0.26
# H1: p < 0.26
se_prop <- sqrt(prop_investigator * (1 - prop_investigator) / num_sample)
z_stat <- (prop_investigator - PROP_HIGH_SCHOOL) / se_prop # wald test
# p-value
pval <- pnorm(z_stat) # left tail test
cat(str_glue("p-value: {round(pval, 4)}"))
## p-value: 0.0566
Conclusión: En conclusión, los resultados de la muestra obtenida en el condado específico de Illinois, donde se encontró que el 22.3% (69 de 310) de las personas han completado la preparatoria, no proporcionan suficiente evidencia estadística para apoyar la hipótesis del investigador de que el porcentaje de graduados de preparatoria en este condado es menor que el promedio estatal de 26%. Esto se debe a que el valor p calculado de 0.0566 es ligeramente superior al nivel de significancia estándar de 0.05. Por lo tanto, no se puede rechazar la hipótesis nula de que no hay diferencia significativa entre el porcentaje del condado y el porcentaje estatal. Sin embargo, dado que el valor p está muy cerca del umbral de significancia, sugiere que podría haber una tendencia hacia un porcentaje menor en el condado, pero esta tendencia no alcanza el nivel de significancia estadística convencional.
1.2 Mendel criaba chícharos de semillas lisas amarillas y de semillas corrugadas verdes. Éstas daban lugar a 4 tipos de descendientes: amarrillas lisas, amarillas corrugadas, verdes lisas y verdes corrugadas. El número de cada una es multinomial con parámetro \(p=(p_1, p_2, p_3, p_4)\). De acuerdo a su teoría de herencia este vector de probabilidades es: \[p=(9/16,3/16,3/16,1/16)\] A lo largo de \(n=556\) experimentos observó \(x=(315,101,108,32)\). Utiliza la prueba de cociente de verosimilitudes para probar \(H_0:p=p_0\) contra \(H_0:p\ne p_0\).
# params
P_PEAS <- c(9/16, 3/16, 3/16, 1/16) # claim
observed_peas <- c(315, 101, 108, 32) # observed number of peas
# hypothesis test #
# H0: p = P_PEAS
# H1: p != P_PEAS
# null hyp
ll_null <- dmultinom(x = observed_peas, prob = P_PEAS, log = TRUE)
# alt hyp
p_maxlike <- observed_peas / sum(observed_peas)
ll_alt <- dmultinom(x = observed_peas, prob = p_maxlike, log = TRUE)
# test statistic likelihood ratio test
chi_stat = 2 * (ll_alt - ll_null)
# p-value
pval <- pchisq(chi_stat, df = 3, lower.tail = FALSE)
# print
cat(str_glue("p-value: {round(pval, 4)}"))
## p-value: 0.9243
# Con simul<ción
p_0 <- c(9/16, 3/16, 3/16, 1/16)
x <- c(315, 101, 108, 32)
n <- sum(x)
# Log-verosimilitud bajo H0 usando dmultinom
ll_H0 <- dmultinom(x, size = n, prob = p_0, log = TRUE)
# Estimación MLE de las probabilidades para H1
p_hat <- x / n
# Log-verosimilitud bajo H1 usando dmultinom con las probabilidades estimadas
ll_H1 <- dmultinom(x, size = n, prob = p_hat, log = TRUE)
# Cálculo del estadístico lambda
lambda <- 2 * (ll_H1 - ll_H0)
### Simulación bajo H0:
lambda_multinomial <- function(x, p_0) {
p_mv <- x / sum(x)
log_p_mv <- sum(dmultinom(x, prob = p_mv, log = TRUE))
log_p_nula <- sum(dmultinom(x, prob = p_0, log = TRUE))
lambda <- 2 * (log_p_mv - log_p_nula)
return(lambda)
}
# Valor observado
x_obs <- c(315, 101, 108, 32)
lambda_obs <- lambda_multinomial(x_obs, p_0)
# Simulación bajo la hipótesis nula
num_simulaciones <- 10000
simulados_nula <- replicate(num_simulaciones, rmultinom(1, n, p_0), simplify = "array")
# Ajustar las dimensiones y convertir en matriz
simulados_nula <- aperm(simulados_nula, c(2, 1, 3))
simulados_nula <- matrix(simulados_nula[1, , ], ncol = length(p_0), byrow = TRUE)
# Calcular lambda para cada simulación
lambda_vals <- apply(simulados_nula, 1, function(sim_data) {
lambda_multinomial(as.integer(sim_data), p_0)
})
# Crear un data frame para el gráfico
sims_tbl <- tibble(lambda = lambda_vals)
# Gráfico de las simulaciones
g1 <- ggplot(sims_tbl, aes(x = lambda)) +
geom_histogram(binwidth = 0.7) +
geom_vline(xintercept = lambda_obs, color = "red")
# Calcular el p-valor
p_valor <- mean(sims_tbl$lambda >= lambda_obs)
p_valor
## [1] 0.9233
Conclusión: Los resultados obtenidos muestran un p-valor aproximado de 0.925, tanto en el cálculo teórico (0.9243) como en la simulación (0.9251). Este valor p alto sugiere que no existe evidencia estadística suficiente para rechazar la hipótesis nula. En consecuencia, se concluye que las proporciones observadas en los experimentos de Mendel no presentan diferencias significativas respecto a las proporciones teóricas propuestas por él. Esto refuerza la validez de la teoría de Mendel sobre la herencia genética, basada en los datos analizados.
1.3. Sean \(X_1, ...X_n \sim Poisson(\lambda)\),
Sea \(\lambda_0>0\). ¿Cuál es la prueba Wald para \(H_0: \lambda = \lambda_0, H_1: \lambda \neq \lambda_0\)
Si \(\lambda_0=1\), \(n=20\) y \(\alpha = 0.05\). Simula \(X_1, ...X_n \sim Poisson(\lambda_0)\) y realiza la prueba Wald, repite 1000 veces y registra el porcentaje de veces que rechazas \(H_0\), qué tan cerca te queda el error del tipo 1 de \(0.05\)?
# params
LAMBDA0 <- 1
N_SAMPLE <- 20
ALPHA <- 0.05
N_BOOT <- 1000
SEED <- 8
# function to simulate data and use test
wald_test <- function(x, lambda){
# test statistic
lambda_estimate <- mean(x)
se_lambda <- sqrt(lambda_estimate / length(x))
z_stat <- (lambda_estimate - lambda) / se_lambda
# p-value
pval <- 2 * pnorm(-abs(z_stat))
return(pval)
}
simulate_wald <- function(lambda0, n_sample, n_boot, seed){
# generate data
set.seed(seed)
# bootstrap
pval_boot <- replicate(
n=n_boot,
expr=wald_test(
x=rpois(n_sample, lambda0),
lambda=lambda0
)
)
# return
return(pval_boot)
}
# generate data
pval_wald_poiss <- simulate_wald(
lambda0=LAMBDA0,
n_sample=N_SAMPLE,
n_boot=N_BOOT,
seed=SEED
)
# look number of times we reject the null hypothesis
cat(str_glue("rejected times: {100 * mean(pval_wald_poiss < ALPHA)}%"))
## rejected times: 4.9%
Conclusión: En la prueba de Wald para evaluar la hipótesis nula H0: lamda = lambda0 vs H1: lambda != lambda0, con lambda0 = 1, un tamaño de muestra de 20 y un nivel de significancia de 0.05, se realizó una simulación 1000 veces siguiendo una distribución Poisson.
En estas simulaciones, la hipótesis nula fue rechazada aproximadamente en el 4.9% de los casos. Este resultado está muy cerca del nivel de significancia del 5%, lo que indica que la prueba de Wald está funcionando de manera adecuada bajo las condiciones establecidas. Un porcentaje de rechazo cercano al 5% en la simulación sugiere que el error de tipo 1, es decir, la probabilidad de rechazar incorrectamente la hipótesis nula cuando es verdadera, está bien controlado y coincide con el nivel de significancia propuesto. Por lo tanto, los resultados de la simulación respaldan la exactitud y fiabilidad de la prueba de Wald en este contexto específico.
Consideremos el caso en que tenemos una única observación \(x\) proveniente de una distribución normal
\[x \sim N(\theta, 1)\]
Supongamos ahora que elegimos una distribución inicial Normal.
\[\theta \sim N(0, \tau)\]
dando lugar a la distribución posterior (como vimos en la tarea)
\[\theta|x \sim N\bigg(\frac{x}{1 + 1/\tau}, \frac{1}{1+1/\tau}\bigg)\]
Ahora, entre mayor \(\tau\), más se concentra la posterior en el estimador de máxima verosimilitud \(\hat{\theta}=x\). En el límite, cuando \(\tau \to \infty\) obtenemos una inicial no-informativa (constante) y la distribución posterior
\[\theta|x \sim N(x,1)\]
Esta posterior coincide con la distribución de bootstrap paramétrico en que generamos valores \(x^*\) de \(N(x,1)\), donde \(x\) es el estimador de máxima verosimilitud.
Lo anterior se cumple debido a que utilizamos un ejemplo Normal pero
también se cumple aproximadamente en otros casos, lo que conlleva a una
correspondencia entre el bootstrap paramétrico y la inferencia
bayesiana. En este caso, la distribución bootstrap representa
(aproximadamente) una distribución posterior no-informartiva del
parámetro de interés. Mediante la perturbación en los datos el bootstrap
aproxima el efecto bayesiano de perturbar los parámetros con la ventaja
de ser más simple de implementar (en muchos casos).
*Los detalles se pueden leer en The Elements of Statistical
Learning de Hastie y Tibshirani.
Comparemos los métodos en otro problema con el fin de apreciar la similitud en los procedimientos:
Supongamos \(x_1,...,x_n \sim N(0, \sigma^2)\), es decir, los datos provienen de una distribución con media cero y varianza desconocida.
En los puntos 2.1 y 2.2 buscamos hacer inferencia del parámetro \(\sigma^2\).
Escribe la función de log-verosimilitud y calcula el estimador de
máxima verosimilitud para \(\sigma^2\).
Supongamos que observamos los datos x (en la carpeta
datos), ¿Cuál es tu estimación de la varianza?
Respuesta: Sabemos que \(\{X_{i}\}_{i}^{n} \sim \mathbb{N}(0, \sigma^{2})\), donde \(\sigma^{2}\) es desconocido. Entonces, la función de log-verosimilitud es: \[\ell(\sigma^{2}; \underline{x}) = - \frac{n}{2}\log(\sigma^{2}) - \frac{1}{2 \sigma^{2}}\sum_{i}^{n}x_{i}^{2}-\frac{n}{2}\log(2\pi).\] Derivando respecto a \(\sigma^{2}\) y tomando máximos, tenemos que la estimación máximo verosimil de la varianza es: \[\hat{\sigma}^{2} = \frac{1}{n}\sum_{i}^{n}x_{i}^{2}.\] Es importante mostrar que \(\hat{\sigma}^{2}\) es insesgado, pues: \[\mathbb{E}[\hat{\sigma}^{2}] = \frac{1}{n}\sum_{i}^{n}\mathbb{E}[x_{i}^{2}] = \frac{1}{n}\sum_{i}^{n}\mathbb{V}[x_{i}] = \frac{1}{n}\sum_{i}^{n}\sigma^{2} = \sigma^{2}.\]
load("data/x.RData")
# rename x to x_obs
x_obs <- x
# delete x
rm(x)
#### 2.1.1 log like of a normal with known mean
create_normal_loglike <- function(x){
# x is a vector of observed values
function(sigma2){
# sigma is the standard deviation
-length(x) * log(sigma2) - sum(x^2) / (sigma2)
}
}
# get log likelihood
loglike_xobs <- create_normal_loglike(x_obs)
# get max like by optimization
sigma2_maxlike <- optim(
par=1,
fn=loglike_xobs,
method="Brent",
control=list(fnscale=-1),
lower=0,
upper=1000
)$par
# print
cat(str_glue("max like sigma2 (by optimization): {round(sigma2_maxlike, 4)}"))
## max like sigma2 (by optimization): 131.291
cat("\n")
cat(str_glue("max like sigma2 (by fórmula): {round(1/length(x_obs) * sum(x_obs^2), 4)}"))
## max like sigma2 (by fórmula): 131.291
Aproxima el error estándar de la estimación usando bootstrap paramétrico y realiza un histograma de las replicaciones bootstrap.
simulate_sigma2 <- function(sigma2, n_obs, n_boot, seed=8){
# n_boot is the number of bootstrap samples
# sigma2 is the claim
# simulate data
set.seed(seed)
# return bootstrap
sigma2_boot <- replicate(
n=n_boot,
expr=var(rnorm(n_obs, mean=0, sd=sqrt(sigma2)))
)
}
# simulate bootstrap
sigma2_boot <- simulate_sigma2(
sigma2=sigma2_maxlike,
n_obs=length(x_obs),
n_boot=10000,
seed=8
)
# get standard error of the bootstrap
se_sigma2_boot <- sd(sigma2_boot)
# plot bootstrap and add vertical line in claim
qplot(sigma2_boot, geom="histogram", bins=30) +
geom_vline(xintercept=sigma2_maxlike, color="red") +
labs(title="Bootstrap paramétrico", subtitle = "Para E['sigma2']", x="sigma2", y="densidad") +
# annotate the standard error
annotate(
geom="text",
x=200,
y=0.001,
label=str_glue("se = {round(se_sigma2_boot, 4)}"),
color="gray20"
) +
theme_bw()
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Continuamos con el problema de hacer inferencia de \(\sigma^2\). Comienza especificando una inicial Gamma Inversa, justifica tu elección de los parámetros de la distribución inicial y grafica la función de densidad.
Respuesta: Definimos el siguiente modelo:
Como la distribución posterior es \(p(\sigma^{2} | \underline{x}) \propto p(\underline{x} | \sigma^{2}) p(\sigma^{2})\), tenemos que:
La posterior para \(\tau\) es: \[ p(\tau | \underline{x}) \propto \tau^{(n/2 + \alpha) - 1} \exp\left(-\left(\beta + \frac{1}{2}\sum_{i}^{n}x_{i}^{2}\right) \tau\right). \] De esta forma, \(\tau | \underline{x} \sim \text{Gamma}\left(n/2 + \alpha, \beta + \frac{1}{2}\sum_{i}^{n}x_{i}^{2}\right)\) y \(\sigma^{2} | \underline{x} \sim \text{GI}\left(n/2 + \alpha, \beta + \frac{1}{2}\sum_{i}^{n}x_{i}^{2}\right)\).
Ejemplificando con los datos anteriores, observamos lo siguiente:
# params
ALPHA <- 1.5 # we are not so sure about the prior
beta <- ALPHA * sigma2_maxlike # we use the point estimate of the variance as the scale parameter
dinv_gamma <- function(sigma2, alpha, beta){
# returns the density of the inverse gamma distribution
alpha * beta^alpha / gamma(alpha) * sigma2^(-alpha - 1) * exp(-beta / sigma2)
}
sigmas2 <- seq(1, 1000, 1)
qplot(sigmas2, dinv_gamma(sigmas2, alpha=ALPHA, beta=beta), geom="line") +
labs(title="Distribución inicial", subtitle = "Gamma Inversa", x="sigma2", y="densidad") +
annotate(
geom="text",
x=900,
y=0.001,
label=str_glue("alpha = {ALPHA}\nbeta = {round(beta, 4)}"),
color="gray20"
) +
theme_bw()
Realiza un histograma de simulaciones de la distribución posterior y calcula el error estándar de la distribución.
simulate_sigma2_posterior <- function(x, alpha, beta, seed=8, n_sim=1000){
# x is the observed data
# alpha is the shape
# beta is the scale
# seed is the seed for the random number generator
# update alpha and beta
alpha_posterior <- alpha + length(x) / 2
beta_posterior <- beta + sum(x^2) / 2
# simulate from posterior
set.seed(seed)
sigma2_posterior <- 1 / rgamma(n=n_sim, shape=alpha_posterior, rate=beta_posterior)
return(sigma2_posterior)
}
# simulate
sigma2_posterior <- simulate_sigma2_posterior(
x=x_obs,
alpha=ALPHA,
beta=beta,
seed=8,
n_sim=10000
)
# se
se_sigma2_posterior <- sd(sigma2_posterior)
# graph posterior and add vertical line in claim
qplot(sigma2_posterior, geom="density", bins=20, kernel = 'epanechnikov') +
labs(title="Distribución posterior", subtitle = "Gamma Inversa", x="sigma2", y="densidad") +
xlim(0, 1000) +
# annotate the standard error
annotate(
geom="text",
x=900,
y=0.001,
label=str_glue("se = {round(se_sigma2_posterior, 4)}"),
color="gray20"
) +
theme_bw()
## Warning in geom_density(bins = 20, kernel = "epanechnikov"): Ignoring unknown
## parameters: `bins`
Conclusiones: Vemos que se aprendió algo de la varianza de los
datos, puesto que la posterior se encogió.
¿Cómo se comparan tus resultados con los de bootstrap paramétrico?
# compare distributions
df_compar <- tibble(
sigma2_boot=sigma2_boot,
sigma2_posterior=sigma2_posterior
) |>
pivot_longer(
cols=c(sigma2_boot, sigma2_posterior),
names_to="method",
values_to="sigma2"
) |>
mutate(
method=fct_recode(method, "boot"="sigma2_boot", "bayes"="sigma2_posterior")
)
# plot densities
ggplot(df_compar, aes(x=sigma2, fill=method)) +
geom_density(alpha=0.5) +
ggtitle("Comparación de densidades") +
theme_minimal()
# plot ecdf
ggplot(df_compar, aes(x=sigma2, color=method)) +
stat_ecdf() +
ggtitle("Comparación de distribuciones") +
theme_minimal()
Conclusiones: Vemos que las distribuciones son muy similares, pero la posterior es un poco más sesgada hacia la derecha, pareciera incluso que el sesgo es por una constante. La simulitud de la distribución por remuestreo paramétrico boostrap y la distribución posterior es debido a que \(sigma^{2}\) es relativamente grande y esto causa que la distribución posterior sea muy similar a la distribución inicial ya que casi no hay información en los datos, como sugiere el resultado al principio de la sección 2.
Supongamos que ahora buscamos hacer inferencia del parámetro \(\tau=log(\sigma)\), ¿cuál es el estimador de máxima verosimilitud?
Respuesta: Como el estimador de máxima verosimilitud es invariante a transformaciones monótonas, por teorema, el estimador de máxima verosimilitud de \(\tau\) es \(\hat{\tau} = log(\hat{\sigma})\).
t_mle <- log(sqrt(sigma2_maxlike))
cat("El estimador de máxima verosimilitud de tau es", round(t_mle, 4))
## El estimador de máxima verosimilitud de tau es 2.4387
simulate_t <- function(sigma, n_obs, n_boot, seed=8){
# simulate bootstrap
set.seed(seed)
t_mle_boot <- replicate(
n=n_boot,
expr=log(sd(rnorm(n_obs, mean=0, sd=sigma)))
)
return(t_mle_boot)
}
# simulate t
t_sim <- simulate_t(
sigma=sqrt(sigma2_maxlike),
n_obs=length(x_obs),
n_boot=10000,
seed=8
)
# get 95% confidence interval
alpha_ci <- 0.05
t_freq_ci <- quantile(t_sim, probs=c(alpha_ci/2, 1-alpha_ci/2))
# graph t
qplot(t_sim, geom="histogram", bins=30) +
labs(title="Distribución bootstrap", subtitle = "Transformación de sigma2", x="t = log(sigma)", y="densidad") +
# add vertical line in claim
geom_vline(xintercept=t_mle, col="red") +
# add vertical line in confidence interval
geom_vline(xintercept=t_freq_ci[1], col="red", lty=2) +
geom_vline(xintercept=t_freq_ci[2], col="red", lty=2) +
xlim(2, 3) +
theme_bw()
## Warning: Removed 2 rows containing missing values (`geom_bar()`).
simulate_t_posterior <- function(x, alpha, beta, seed=8, n_sim=1000){
# x is the observed data
# alpha is the shape
# beta is the scale
# seed is the seed for the random number generator
# update alpha and beta
alpha_posterior <- alpha + length(x) / 2
beta_posterior <- beta + sum(x^2) / 2
# simulate from posterior
set.seed(seed)
sigma2_posterior <- 1 / rgamma(n=n_sim, shape=alpha_posterior, rate=beta_posterior)
t_posterior <- log(sqrt(sigma2_posterior))
return(t_posterior)
}
# simulate
t_posterior <- simulate_t_posterior(
x=x_obs,
alpha=ALPHA,
beta=beta,
seed=8,
n_sim=10000
)
# get 95% confidence interval
alpha_ci <- 0.05
t_bayes_ci <- quantile(t_posterior, probs=c(alpha_ci/2, 1-alpha_ci/2))
# plot posterior and add vertical line in claim
qplot(t_posterior, geom="histogram", bins=30) +
labs(title="Distribución posterior", subtitle = "Transformación de sigma2", x="t = log(sigma)", y="densidad") +
geom_vline(xintercept=t_mle, col="red") +
# add vertical line in confidence interval
geom_vline(xintercept=t_bayes_ci[1], col="red", lty=2) +
geom_vline(xintercept=t_bayes_ci[2], col="red", lty=2) +
xlim(2, 3) +
theme_bw()
## Warning: Removed 2 rows containing missing values (`geom_bar()`).
df_compar <- tibble(
t_boot=t_sim,
t_posterior=t_posterior
) |>
pivot_longer(
cols=c(t_boot, t_posterior),
names_to="method",
values_to="t"
) |>
mutate(
method=fct_recode(method, "boot"="t_boot", "bayes"="t_posterior")
)
# plot densities
ggplot(df_compar, aes(x=t, fill=method)) +
geom_density(alpha=0.5) +
ggtitle("Comparación de densidades") +
theme_minimal()
# plot ecdf
ggplot(df_compar, aes(x=t, color=method)) +
stat_ecdf() +
ggtitle("Comparación de funciones de distribución acumulada") +
theme_minimal()
Conclusiones: Al comparar las distribuciones de bootstrap y
bayesiana, se observa que ambas son similares, por lo misma causa del
inciso anterior.
Los datos pew_research_center_june_elect_wknd_data.dta tienen información de ecnuestas realizadas durante la campaña presidencial 2008 de EUA.
# read an dta file
df_polls <- foreign::read.dta("data/pew_research_center_june_elect_wknd_data.dta")
glimpse(df_polls)
## Rows: 31,201
## Columns: 70
## $ survey <fct> june08voter, aug08relig, aug08relig, aug08relig, june08vot…
## $ rid <int> 1720, 668, 50, 50533, 30091, 60, 20956, 749, 785, 1979, 18…
## $ date <int> 62708, 80208, 73108, 80208, 62108, 72708, 62608, 80208, 62…
## $ sample <fct> landline, landline, landline, cell, 18-29 oversample (land…
## $ phoneuse <fct> "dual-all or almost all calls on home phone", "dual-all or…
## $ zipcode <chr> "01007", "01010", "01013", "01013", "01020", "01020", "010…
## $ msa <int> 44140, 44140, 44140, 44140, 44140, 44140, 44140, 44140, 44…
## $ usr <chr> "S", "2", "2", "", "S", "S", "U", "2", "S", "S", "2", "", …
## $ form <fct> form 2, form 1, form 1, form 2, form 1, form 2, form 2, fo…
## $ thoughtpres <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ regist <fct> registered, registered, registered, registered, registered…
## $ regicert <fct> absolutely certain, absolutely certain, absolutely certain…
## $ party <fct> democrat, republican, independent, independent, democrat, …
## $ partyln <fct> NA, NA, lean democrat, other/dk, NA, NA, lean democrat, NA…
## $ sex <fct> male, female, male, male, female, female, male, male, male…
## $ age <int> 58, 35, 59, 32, 23, 55, 40, 59, 60, 41, 54, 27, 56, 64, 50…
## $ educ <fct> "post-graduate", "college graduate", "some college", "some…
## $ hisp <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no, yes, …
## $ race <int> 1, 1, 1, 9, 3, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 4…
## $ marital <fct> divorced, married, married, living with a partner, never m…
## $ parent <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ relig <fct> roman catholic, no religion/atheist, roman catholic, prote…
## $ relig2 <fct> roman catholic, nothing in particular, roman catholic, chr…
## $ born <int> 2, NA, 2, 2, NA, 2, 2, 2, 2, NA, 2, NA, 2, 1, 2, 1, 2, 2, …
## $ attend <fct> a few times a year, never, once a week, once or twice a mo…
## $ income <fct> "$50,000-$74,999", "$100,000-$149,999", "dk/refused", "$30…
## $ ownrent <fct> own, NA, NA, NA, rent, own, rent, NA, rent, own, NA, NA, o…
## $ ideo <fct> liberal, moderate, conservative, liberal, moderate, modera…
## $ employ <fct> NA, NA, NA, NA, NA, full-time, NA, NA, NA, full-time, NA, …
## $ labor <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ weight <dbl> 1.326923, 0.822000, 0.493000, 0.492000, 2.000000, 1.800000…
## $ density <int> 2, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3…
## $ attempt <int> 9, NA, NA, NA, 3, 7, 7, NA, 1, 8, NA, NA, 9, 1, 1, 5, NA, …
## $ fcall <int> 80624, NA, NA, NA, 80621, 80723, 80621, NA, 80622, 80725, …
## $ thought <fct> quite a lot, quite a lot, quite a lot, quite a lot, quite …
## $ heat2a <fct> dem, dk, dem, dem, dem, rep, dem, rep, dem, dem, rep, NA, …
## $ heat2b <fct> NA, dk, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, re…
## $ intsex <fct> female, NA, NA, NA, male, female, female, NA, female, fema…
## $ intrace <int> 2, NA, NA, NA, 1, 1, 1, NA, 1, 2, NA, NA, 2, 1, NA, 2, NA,…
## $ area <int> 413, NA, NA, NA, 413, 413, 413, NA, 413, 413, NA, NA, 413,…
## $ niicamp <fct> very closely, NA, NA, NA, very closely, NA, not at all clo…
## $ heat2c <fct> strongly, NA, only moderately, strongly, strongly, only mo…
## $ chancer <fct> decided not to vote for republican, dk/refused, chance mig…
## $ chanced <fct> NA, dk/refused, NA, NA, NA, chance might vote for democrat…
## $ planto1 <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, NA,…
## $ planto2 <fct> absolutely certain, NA, NA, NA, absolutely certain, NA, ab…
## $ cheata <fct> democrat, NA, NA, NA, democrat, NA, democrat, NA, democrat…
## $ cheatb <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ precinct <fct> NA, yes, yes, yes, NA, yes, NA, yes, NA, yes, yes, NA, NA,…
## $ oftvote <fct> NA, nearly always, always, always, NA, always, NA, always,…
## $ scale10 <int> NA, 10, 10, 10, NA, 10, NA, 10, NA, 10, 10, NA, NA, NA, 10…
## $ pvote08 <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ inthisp <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, no, no…
## $ where <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ heat4 <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ pvote04 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ heat4a <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ heat4b <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ heat4c <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ fips <int> 15, 13, 13, 13, 13, 13, 15, 13, 13, 15, 15, 15, 15, 13, 13…
## $ state <fct> massachusetts, massachusetts, massachusetts, massachusetts…
## $ cregion <fct> east, east, east, east, east, east, east, east, east, east…
## $ partysum <fct> democrat/lean democrat, republican/lean republican, democr…
## $ relign <int> 5, 10, 5, 4, 9, 5, 5, 5, 5, 10, 2, 9, 5, 6, 2, 5, 5, 5, 2,…
## $ heat2 <fct> dem/lean dem, other-dk, dem/lean dem, dem/lean dem, dem/le…
## $ cheat <fct> dem/lean dem, NA, NA, NA, dem/lean dem, NA, dem/lean dem, …
## $ age2 <fct> 50-64, 30-49, 50-64, 30-49, 18-29, 50-64, 30-49, 50-64, 50…
## $ educ2 <fct> college graduate, college graduate, some college, some col…
## $ income2 <fct> "$50,000 to $74,999", "$75,000+", NA, "$30,000 to $49,999"…
## $ party4 <fct> democrat, republican, independent, independent, democrat, …
# read election file
df_elections <- read_csv("data/2008ElectionResult.csv")
## Rows: 51 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): state
## dbl (6): vote_Obama, vote_Obama_pct, vote_McCain, vote_McCain_pct, electoral...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(df_elections)
## Rows: 51
## Columns: 7
## $ state <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "Califo…
## $ vote_Obama <dbl> 811764, 105650, 948648, 418049, 7245731, 1216793, 9…
## $ vote_Obama_pct <dbl> 38.8, 37.7, 45.0, 38.8, 60.9, 53.5, 60.5, 61.9, 92.…
## $ vote_McCain <dbl> 1264879, 168844, 1132560, 632672, 4434146, 1020135,…
## $ vote_McCain_pct <dbl> 60.4, 60.2, 53.8, 58.8, 37.3, 44.9, 38.3, 37.0, 6.5…
## $ electoral_vote_dem <dbl> NA, NA, NA, NA, 55, 9, 7, 3, 3, 27, NA, 4, NA, 21, …
## $ electoral_vote_rep <dbl> 9, 3, 10, 6, NA, NA, NA, NA, NA, NA, 15, NA, 4, NA,…
## get tables
# prop as very liberals
table_polls <- df_polls |>
mutate(
is_very_liberal = (ideo == "very liberal")
) |>
group_by(state) |>
summarise(
total_vliberal = sum(is_very_liberal, na.rm=TRUE),
prop_vliberal = mean(is_very_liberal, na.rm=TRUE),
n_polls = sum(!is.na(is_very_liberal))
) |>
arrange(state) |>
filter(
!(state %in% c("alaska", "hawaii", "washington dc")) # remove these states
)
# prop of votes for obama
table_obama <- df_elections |>
mutate(state = tolower(state)) |>
inner_join(table_polls, by=c("state")) |>
select(state, prop_vliberal, vote_Obama_pct)
table_polls |>
mutate(state_abr = str_sub(state, 1, 2)) |>
ggplot(aes(x=n_polls, y=prop_vliberal)) +
geom_smooth(se=FALSE) +
ggrepel::geom_text_repel(aes(label=state), size=3, max.overlaps = 1000) +
geom_point() +
# ggrepel, add lines to the points
theme_minimal() +
labs(title="Encuestas por estado", subtitle="Estimación de máxima verosimilitud", x="Número de encuestas", y="Proporción de muy liberales") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
table_obama |>
mutate(vote_Obama_pct = vote_Obama_pct / 100) |>
ggplot(aes(x=vote_Obama_pct, y=prop_vliberal)) +
geom_smooth(se=FALSE) +
geom_point() +
ggrepel::geom_text_repel(aes(label=state), size=3, max.overlaps = 1000) +
theme_minimal() +
labs(title="Votos por estado", subtitle="Estimación de máxima verosimilitud", x="Porcentaje de votos para Obama", y="Proporción de muy liberales") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
scale_x_continuous(labels = scales::percent_format(accuracy = 1))
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Estima el mismo porcentaje (very liberal) usando inferencia bayesiana, en particular la familia conjugada beta-binomial. Deberás estimar la proporción de manera independiente para cada estado, sin embargo, utilizarás la misma inicial a lo largo de todos: \(Beta(8,160)\).
# params
A <- 8
B <- 160
# simulate posterior of a proportion
simulate_prop_posterior <- function(n_success, n_obs, n_sim=1000, A=2, B=2, seed=8){
# update A and B
A_posterior <- A + n_success
B_posterior <- B + (n_obs - n_success)
# simulate from posterior
set.seed(seed)
prop_posterior <- rbeta(n=n_sim, shape1=A_posterior, shape2=B_posterior)
return(prop_posterior)
}
# simulate posterior for each state
df_obama_posterior <- table_polls |>
# simulate and save them in a list and then expand
group_by(state) |>
mutate(
prop_posterior = list(simulate_prop_posterior(
n_success=total_vliberal,
n_obs=n_polls,
n_sim=10000,
A=A,
B=B,
seed=8
))
) |>
ungroup() |>
select(state, prop_posterior) |>
unnest(prop_posterior)
# get mean, std and 95% confidence interval for each state
table_stats_obama <- df_obama_posterior |>
group_by(state) |>
summarise(
mean = mean(prop_posterior),
std = sd(prop_posterior),
ci_lower = quantile(prop_posterior, probs=0.025),
ci_upper = quantile(prop_posterior, probs=0.975)
) |>
arrange(state)
# knitr
knitr::kable(
table_stats_obama,
caption="Estadísticas de la distribución posterior de la proporción de muy liberales por estado"
)
| state | mean | std | ci_lower | ci_upper |
|---|---|---|---|---|
| alabama | 0.0484021 | 0.0076987 | 0.0343917 | 0.0643815 |
| arizona | 0.0513489 | 0.0083704 | 0.0361438 | 0.0687152 |
| arkansas | 0.0317529 | 0.0080595 | 0.0178346 | 0.0493492 |
| california | 0.0628283 | 0.0044829 | 0.0543398 | 0.0718376 |
| colorado | 0.0552859 | 0.0091265 | 0.0387084 | 0.0742274 |
| connecticut | 0.0393398 | 0.0082564 | 0.0247371 | 0.0568471 |
| delaware | 0.0527789 | 0.0132501 | 0.0297962 | 0.0814911 |
| florida | 0.0387416 | 0.0044725 | 0.0303804 | 0.0478867 |
| georgia | 0.0422702 | 0.0058854 | 0.0314348 | 0.0544068 |
| idaho | 0.0716168 | 0.0147347 | 0.0453631 | 0.1026859 |
| illinois | 0.0662199 | 0.0069976 | 0.0530199 | 0.0805053 |
| indiana | 0.0375571 | 0.0060889 | 0.0264986 | 0.0502046 |
| iowa | 0.0410938 | 0.0080710 | 0.0267168 | 0.0581234 |
| kansas | 0.0608405 | 0.0107986 | 0.0413562 | 0.0833683 |
| kentucky | 0.0366359 | 0.0072124 | 0.0238018 | 0.0518440 |
| louisiana | 0.0408209 | 0.0071903 | 0.0278580 | 0.0558650 |
| maine | 0.0527725 | 0.0124857 | 0.0309593 | 0.0794817 |
| maryland | 0.0423721 | 0.0073548 | 0.0290939 | 0.0577209 |
| massachusetts | 0.0636201 | 0.0084200 | 0.0480131 | 0.0809082 |
| michigan | 0.0432194 | 0.0060144 | 0.0321447 | 0.0556203 |
| minnesota | 0.0456648 | 0.0070878 | 0.0327499 | 0.0604211 |
| mississippi | 0.0559129 | 0.0111152 | 0.0361022 | 0.0793512 |
| missouri | 0.0373018 | 0.0062050 | 0.0260574 | 0.0501895 |
| montana | 0.0388667 | 0.0120240 | 0.0186221 | 0.0655355 |
| nebraska | 0.0372037 | 0.0097384 | 0.0204631 | 0.0585129 |
| nevada | 0.0435741 | 0.0106678 | 0.0250370 | 0.0666169 |
| new hampshire | 0.0274289 | 0.0089982 | 0.0124776 | 0.0475321 |
| new jersey | 0.0409834 | 0.0062288 | 0.0296117 | 0.0539269 |
| new mexico | 0.0417450 | 0.0102255 | 0.0239694 | 0.0638258 |
| new york | 0.0610827 | 0.0056383 | 0.0504259 | 0.0725358 |
| north carolina | 0.0430132 | 0.0058689 | 0.0321860 | 0.0551055 |
| north dakota | 0.0346874 | 0.0107530 | 0.0166031 | 0.0585628 |
| ohio | 0.0449746 | 0.0052843 | 0.0351029 | 0.0557931 |
| oklahoma | 0.0592200 | 0.0097604 | 0.0415010 | 0.0794466 |
| oregon | 0.0759603 | 0.0105972 | 0.0563774 | 0.0977510 |
| pennsylvania | 0.0358052 | 0.0044871 | 0.0274679 | 0.0450013 |
| rhode island | 0.0469373 | 0.0122178 | 0.0258731 | 0.0735810 |
| south carolina | 0.0385033 | 0.0077286 | 0.0247526 | 0.0548244 |
| south dakota | 0.0497662 | 0.0134297 | 0.0267135 | 0.0791928 |
| tennessee | 0.0443006 | 0.0068813 | 0.0317652 | 0.0586303 |
| texas | 0.0397633 | 0.0043345 | 0.0316248 | 0.0486081 |
| utah | 0.0515421 | 0.0104945 | 0.0328803 | 0.0737152 |
| vermont | 0.0567206 | 0.0138055 | 0.0327195 | 0.0863737 |
| virginia | 0.0418327 | 0.0062054 | 0.0304803 | 0.0546893 |
| washington | 0.0703216 | 0.0089536 | 0.0536616 | 0.0886316 |
| west virginia | 0.0275674 | 0.0078464 | 0.0142804 | 0.0447683 |
| wisconsin | 0.0418473 | 0.0066798 | 0.0297038 | 0.0556776 |
| wyoming | 0.0456541 | 0.0148214 | 0.0209098 | 0.0785390 |
# plot distribution for each state
df_obama_posterior |>
ggplot(aes(x=prop_posterior)) +
geom_histogram(bins=50) +
geom_vline(
data=table_stats_obama,
aes(xintercept=mean),
color="red"
) +
geom_vline(
data=table_stats_obama,
aes(xintercept=ci_lower),
color="red",
linetype="dashed"
) +
geom_vline(
data=table_stats_obama,
aes(xintercept=ci_upper),
color="red",
linetype="dashed"
) +
facet_wrap(~state, ncol=5) +
theme_minimal() +
labs(title="Distribución posterior de la proporción de muy liberales", x="Proporción de muy liberales", y="Frecuencia", subtitle = "Por estado")
IDAHO
#install.packages('cmdstanr', repos = c('https://mc-stan.org/r-packages/', getOption('repos')))
library(cmdstanr)
## This is cmdstanr version 0.6.1
## - CmdStanR documentation and vignettes: mc-stan.org/cmdstanr
## - CmdStan path: /Users/sofiagerard/.cmdstan/cmdstan-2.33.1
## - CmdStan version: 2.33.1
library(posterior)
## This is posterior version 1.5.0
##
## Attaching package: 'posterior'
## The following objects are masked from 'package:stats':
##
## mad, sd, var
## The following objects are masked from 'package:base':
##
## %in%, match
# Instalar CmdStan usando cmdstanr
cmdstanr::install_cmdstan()
## The C++ toolchain required for CmdStan is setup properly!
## * Latest CmdStan release is v2.33.1
## * Installing CmdStan v2.33.1 in /Users/sofiagerard/.cmdstan/cmdstan-2.33.1
## * Downloading cmdstan-2.33.1.tar.gz from GitHub...
## Warning: An installation already exists at
## /Users/sofiagerard/.cmdstan/cmdstan-2.33.1. Please remove or rename the
## installation folder or set overwrite=TRUE.
# Compilar el modelo Stan
model <- cmdstan_model("stan/beta_binomial_model.stan")
## get data for idaho
data_idaho <- table_polls |>
filter(state == 'idaho')
data_idaho
## # A tibble: 1 × 4
## state total_vliberal prop_vliberal n_polls
## <fct> <int> <dbl> <int>
## 1 idaho 14 0.101 139
# fit model
fit_stan <- model$sample(
data = list(
n = data_idaho$n_polls,
y = data_idaho$total_vliberal
),
chains = 8,
parallel_chains = 4,
iter_warmup = 1000,
iter_sampling = 10000,
refresh = 500,
seed = 8
)
## Running MCMC with 8 chains, at most 4 in parallel...
##
## Chain 1 Iteration: 1 / 11000 [ 0%] (Warmup)
## Chain 1 Iteration: 500 / 11000 [ 4%] (Warmup)
## Chain 1 Iteration: 1000 / 11000 [ 9%] (Warmup)
## Chain 1 Iteration: 1001 / 11000 [ 9%] (Sampling)
## Chain 1 Iteration: 1500 / 11000 [ 13%] (Sampling)
## Chain 1 Iteration: 2000 / 11000 [ 18%] (Sampling)
## Chain 1 Iteration: 2500 / 11000 [ 22%] (Sampling)
## Chain 1 Iteration: 3000 / 11000 [ 27%] (Sampling)
## Chain 1 Iteration: 3500 / 11000 [ 31%] (Sampling)
## Chain 1 Iteration: 4000 / 11000 [ 36%] (Sampling)
## Chain 1 Iteration: 4500 / 11000 [ 40%] (Sampling)
## Chain 1 Iteration: 5000 / 11000 [ 45%] (Sampling)
## Chain 1 Iteration: 5500 / 11000 [ 50%] (Sampling)
## Chain 1 Iteration: 6000 / 11000 [ 54%] (Sampling)
## Chain 1 Iteration: 6500 / 11000 [ 59%] (Sampling)
## Chain 1 Iteration: 7000 / 11000 [ 63%] (Sampling)
## Chain 2 Iteration: 1 / 11000 [ 0%] (Warmup)
## Chain 2 Iteration: 500 / 11000 [ 4%] (Warmup)
## Chain 2 Iteration: 1000 / 11000 [ 9%] (Warmup)
## Chain 2 Iteration: 1001 / 11000 [ 9%] (Sampling)
## Chain 2 Iteration: 1500 / 11000 [ 13%] (Sampling)
## Chain 2 Iteration: 2000 / 11000 [ 18%] (Sampling)
## Chain 2 Iteration: 2500 / 11000 [ 22%] (Sampling)
## Chain 2 Iteration: 3000 / 11000 [ 27%] (Sampling)
## Chain 2 Iteration: 3500 / 11000 [ 31%] (Sampling)
## Chain 2 Iteration: 4000 / 11000 [ 36%] (Sampling)
## Chain 2 Iteration: 4500 / 11000 [ 40%] (Sampling)
## Chain 2 Iteration: 5000 / 11000 [ 45%] (Sampling)
## Chain 2 Iteration: 5500 / 11000 [ 50%] (Sampling)
## Chain 2 Iteration: 6000 / 11000 [ 54%] (Sampling)
## Chain 2 Iteration: 6500 / 11000 [ 59%] (Sampling)
## Chain 2 Iteration: 7000 / 11000 [ 63%] (Sampling)
## Chain 2 Iteration: 7500 / 11000 [ 68%] (Sampling)
## Chain 2 Iteration: 8000 / 11000 [ 72%] (Sampling)
## Chain 2 Iteration: 8500 / 11000 [ 77%] (Sampling)
## Chain 2 Iteration: 9000 / 11000 [ 81%] (Sampling)
## Chain 2 Iteration: 9500 / 11000 [ 86%] (Sampling)
## Chain 3 Iteration: 1 / 11000 [ 0%] (Warmup)
## Chain 3 Iteration: 500 / 11000 [ 4%] (Warmup)
## Chain 3 Iteration: 1000 / 11000 [ 9%] (Warmup)
## Chain 3 Iteration: 1001 / 11000 [ 9%] (Sampling)
## Chain 3 Iteration: 1500 / 11000 [ 13%] (Sampling)
## Chain 3 Iteration: 2000 / 11000 [ 18%] (Sampling)
## Chain 3 Iteration: 2500 / 11000 [ 22%] (Sampling)
## Chain 3 Iteration: 3000 / 11000 [ 27%] (Sampling)
## Chain 3 Iteration: 3500 / 11000 [ 31%] (Sampling)
## Chain 3 Iteration: 4000 / 11000 [ 36%] (Sampling)
## Chain 3 Iteration: 4500 / 11000 [ 40%] (Sampling)
## Chain 3 Iteration: 5000 / 11000 [ 45%] (Sampling)
## Chain 3 Iteration: 5500 / 11000 [ 50%] (Sampling)
## Chain 3 Iteration: 6000 / 11000 [ 54%] (Sampling)
## Chain 3 Iteration: 6500 / 11000 [ 59%] (Sampling)
## Chain 3 Iteration: 7000 / 11000 [ 63%] (Sampling)
## Chain 3 Iteration: 7500 / 11000 [ 68%] (Sampling)
## Chain 3 Iteration: 8000 / 11000 [ 72%] (Sampling)
## Chain 3 Iteration: 8500 / 11000 [ 77%] (Sampling)
## Chain 3 Iteration: 9000 / 11000 [ 81%] (Sampling)
## Chain 3 Iteration: 9500 / 11000 [ 86%] (Sampling)
## Chain 3 Iteration: 10000 / 11000 [ 90%] (Sampling)
## Chain 3 Iteration: 10500 / 11000 [ 95%] (Sampling)
## Chain 4 Iteration: 1 / 11000 [ 0%] (Warmup)
## Chain 4 Iteration: 500 / 11000 [ 4%] (Warmup)
## Chain 4 Iteration: 1000 / 11000 [ 9%] (Warmup)
## Chain 4 Iteration: 1001 / 11000 [ 9%] (Sampling)
## Chain 4 Iteration: 1500 / 11000 [ 13%] (Sampling)
## Chain 4 Iteration: 2000 / 11000 [ 18%] (Sampling)
## Chain 4 Iteration: 2500 / 11000 [ 22%] (Sampling)
## Chain 4 Iteration: 3000 / 11000 [ 27%] (Sampling)
## Chain 4 Iteration: 3500 / 11000 [ 31%] (Sampling)
## Chain 4 Iteration: 4000 / 11000 [ 36%] (Sampling)
## Chain 4 Iteration: 4500 / 11000 [ 40%] (Sampling)
## Chain 4 Iteration: 5000 / 11000 [ 45%] (Sampling)
## Chain 4 Iteration: 5500 / 11000 [ 50%] (Sampling)
## Chain 4 Iteration: 6000 / 11000 [ 54%] (Sampling)
## Chain 4 Iteration: 6500 / 11000 [ 59%] (Sampling)
## Chain 4 Iteration: 7000 / 11000 [ 63%] (Sampling)
## Chain 4 Iteration: 7500 / 11000 [ 68%] (Sampling)
## Chain 4 Iteration: 8000 / 11000 [ 72%] (Sampling)
## Chain 4 Iteration: 8500 / 11000 [ 77%] (Sampling)
## Chain 4 Iteration: 9000 / 11000 [ 81%] (Sampling)
## Chain 1 Iteration: 7500 / 11000 [ 68%] (Sampling)
## Chain 1 Iteration: 8000 / 11000 [ 72%] (Sampling)
## Chain 1 Iteration: 8500 / 11000 [ 77%] (Sampling)
## Chain 1 Iteration: 9000 / 11000 [ 81%] (Sampling)
## Chain 1 Iteration: 9500 / 11000 [ 86%] (Sampling)
## Chain 1 Iteration: 10000 / 11000 [ 90%] (Sampling)
## Chain 1 Iteration: 10500 / 11000 [ 95%] (Sampling)
## Chain 1 Iteration: 11000 / 11000 [100%] (Sampling)
## Chain 1 finished in 0.3 seconds.
## Chain 2 Iteration: 10000 / 11000 [ 90%] (Sampling)
## Chain 2 Iteration: 10500 / 11000 [ 95%] (Sampling)
## Chain 2 Iteration: 11000 / 11000 [100%] (Sampling)
## Chain 2 finished in 0.3 seconds.
## Chain 3 Iteration: 11000 / 11000 [100%] (Sampling)
## Chain 3 finished in 0.3 seconds.
## Chain 4 Iteration: 9500 / 11000 [ 86%] (Sampling)
## Chain 4 Iteration: 10000 / 11000 [ 90%] (Sampling)
## Chain 4 Iteration: 10500 / 11000 [ 95%] (Sampling)
## Chain 4 Iteration: 11000 / 11000 [100%] (Sampling)
## Chain 5 Iteration: 1 / 11000 [ 0%] (Warmup)
## Chain 5 Iteration: 500 / 11000 [ 4%] (Warmup)
## Chain 5 Iteration: 1000 / 11000 [ 9%] (Warmup)
## Chain 5 Iteration: 1001 / 11000 [ 9%] (Sampling)
## Chain 5 Iteration: 1500 / 11000 [ 13%] (Sampling)
## Chain 5 Iteration: 2000 / 11000 [ 18%] (Sampling)
## Chain 5 Iteration: 2500 / 11000 [ 22%] (Sampling)
## Chain 5 Iteration: 3000 / 11000 [ 27%] (Sampling)
## Chain 5 Iteration: 3500 / 11000 [ 31%] (Sampling)
## Chain 5 Iteration: 4000 / 11000 [ 36%] (Sampling)
## Chain 5 Iteration: 4500 / 11000 [ 40%] (Sampling)
## Chain 5 Iteration: 5000 / 11000 [ 45%] (Sampling)
## Chain 5 Iteration: 5500 / 11000 [ 50%] (Sampling)
## Chain 5 Iteration: 6000 / 11000 [ 54%] (Sampling)
## Chain 5 Iteration: 6500 / 11000 [ 59%] (Sampling)
## Chain 5 Iteration: 7000 / 11000 [ 63%] (Sampling)
## Chain 6 Iteration: 1 / 11000 [ 0%] (Warmup)
## Chain 6 Iteration: 500 / 11000 [ 4%] (Warmup)
## Chain 6 Iteration: 1000 / 11000 [ 9%] (Warmup)
## Chain 6 Iteration: 1001 / 11000 [ 9%] (Sampling)
## Chain 6 Iteration: 1500 / 11000 [ 13%] (Sampling)
## Chain 6 Iteration: 2000 / 11000 [ 18%] (Sampling)
## Chain 6 Iteration: 2500 / 11000 [ 22%] (Sampling)
## Chain 6 Iteration: 3000 / 11000 [ 27%] (Sampling)
## Chain 6 Iteration: 3500 / 11000 [ 31%] (Sampling)
## Chain 6 Iteration: 4000 / 11000 [ 36%] (Sampling)
## Chain 6 Iteration: 4500 / 11000 [ 40%] (Sampling)
## Chain 6 Iteration: 5000 / 11000 [ 45%] (Sampling)
## Chain 6 Iteration: 5500 / 11000 [ 50%] (Sampling)
## Chain 6 Iteration: 6000 / 11000 [ 54%] (Sampling)
## Chain 6 Iteration: 6500 / 11000 [ 59%] (Sampling)
## Chain 7 Iteration: 1 / 11000 [ 0%] (Warmup)
## Chain 7 Iteration: 500 / 11000 [ 4%] (Warmup)
## Chain 7 Iteration: 1000 / 11000 [ 9%] (Warmup)
## Chain 7 Iteration: 1001 / 11000 [ 9%] (Sampling)
## Chain 7 Iteration: 1500 / 11000 [ 13%] (Sampling)
## Chain 7 Iteration: 2000 / 11000 [ 18%] (Sampling)
## Chain 7 Iteration: 2500 / 11000 [ 22%] (Sampling)
## Chain 7 Iteration: 3000 / 11000 [ 27%] (Sampling)
## Chain 7 Iteration: 3500 / 11000 [ 31%] (Sampling)
## Chain 7 Iteration: 4000 / 11000 [ 36%] (Sampling)
## Chain 7 Iteration: 4500 / 11000 [ 40%] (Sampling)
## Chain 7 Iteration: 5000 / 11000 [ 45%] (Sampling)
## Chain 7 Iteration: 5500 / 11000 [ 50%] (Sampling)
## Chain 7 Iteration: 6000 / 11000 [ 54%] (Sampling)
## Chain 7 Iteration: 6500 / 11000 [ 59%] (Sampling)
## Chain 4 finished in 0.3 seconds.
## Chain 5 Iteration: 7500 / 11000 [ 68%] (Sampling)
## Chain 5 Iteration: 8000 / 11000 [ 72%] (Sampling)
## Chain 5 Iteration: 8500 / 11000 [ 77%] (Sampling)
## Chain 5 Iteration: 9000 / 11000 [ 81%] (Sampling)
## Chain 5 Iteration: 9500 / 11000 [ 86%] (Sampling)
## Chain 5 Iteration: 10000 / 11000 [ 90%] (Sampling)
## Chain 5 Iteration: 10500 / 11000 [ 95%] (Sampling)
## Chain 5 Iteration: 11000 / 11000 [100%] (Sampling)
## Chain 6 Iteration: 7000 / 11000 [ 63%] (Sampling)
## Chain 6 Iteration: 7500 / 11000 [ 68%] (Sampling)
## Chain 6 Iteration: 8000 / 11000 [ 72%] (Sampling)
## Chain 6 Iteration: 8500 / 11000 [ 77%] (Sampling)
## Chain 6 Iteration: 9000 / 11000 [ 81%] (Sampling)
## Chain 6 Iteration: 9500 / 11000 [ 86%] (Sampling)
## Chain 6 Iteration: 10000 / 11000 [ 90%] (Sampling)
## Chain 6 Iteration: 10500 / 11000 [ 95%] (Sampling)
## Chain 6 Iteration: 11000 / 11000 [100%] (Sampling)
## Chain 7 Iteration: 7000 / 11000 [ 63%] (Sampling)
## Chain 7 Iteration: 7500 / 11000 [ 68%] (Sampling)
## Chain 7 Iteration: 8000 / 11000 [ 72%] (Sampling)
## Chain 7 Iteration: 8500 / 11000 [ 77%] (Sampling)
## Chain 7 Iteration: 9000 / 11000 [ 81%] (Sampling)
## Chain 7 Iteration: 9500 / 11000 [ 86%] (Sampling)
## Chain 7 Iteration: 10000 / 11000 [ 90%] (Sampling)
## Chain 7 Iteration: 10500 / 11000 [ 95%] (Sampling)
## Chain 7 Iteration: 11000 / 11000 [100%] (Sampling)
## Chain 8 Iteration: 1 / 11000 [ 0%] (Warmup)
## Chain 8 Iteration: 500 / 11000 [ 4%] (Warmup)
## Chain 8 Iteration: 1000 / 11000 [ 9%] (Warmup)
## Chain 8 Iteration: 1001 / 11000 [ 9%] (Sampling)
## Chain 8 Iteration: 1500 / 11000 [ 13%] (Sampling)
## Chain 8 Iteration: 2000 / 11000 [ 18%] (Sampling)
## Chain 8 Iteration: 2500 / 11000 [ 22%] (Sampling)
## Chain 8 Iteration: 3000 / 11000 [ 27%] (Sampling)
## Chain 8 Iteration: 3500 / 11000 [ 31%] (Sampling)
## Chain 8 Iteration: 4000 / 11000 [ 36%] (Sampling)
## Chain 8 Iteration: 4500 / 11000 [ 40%] (Sampling)
## Chain 8 Iteration: 5000 / 11000 [ 45%] (Sampling)
## Chain 8 Iteration: 5500 / 11000 [ 50%] (Sampling)
## Chain 8 Iteration: 6000 / 11000 [ 54%] (Sampling)
## Chain 5 finished in 0.3 seconds.
## Chain 6 finished in 0.3 seconds.
## Chain 7 finished in 0.3 seconds.
## Chain 8 Iteration: 6500 / 11000 [ 59%] (Sampling)
## Chain 8 Iteration: 7000 / 11000 [ 63%] (Sampling)
## Chain 8 Iteration: 7500 / 11000 [ 68%] (Sampling)
## Chain 8 Iteration: 8000 / 11000 [ 72%] (Sampling)
## Chain 8 Iteration: 8500 / 11000 [ 77%] (Sampling)
## Chain 8 Iteration: 9000 / 11000 [ 81%] (Sampling)
## Chain 8 Iteration: 9500 / 11000 [ 86%] (Sampling)
## Chain 8 Iteration: 10000 / 11000 [ 90%] (Sampling)
## Chain 8 Iteration: 10500 / 11000 [ 95%] (Sampling)
## Chain 8 Iteration: 11000 / 11000 [100%] (Sampling)
## Chain 8 finished in 0.2 seconds.
##
## All 8 chains finished successfully.
## Mean chain execution time: 0.3 seconds.
## Total execution time: 0.9 seconds.
# diagnose
fit_stan$cmdstan_diagnose()
## Processing csv files: /var/folders/8v/p9jmdytd6y36kdsq55l4_cqw0000gn/T/RtmptDMyQC/beta_binomial_model-202312021434-1-43ed80.csv, /var/folders/8v/p9jmdytd6y36kdsq55l4_cqw0000gn/T/RtmptDMyQC/beta_binomial_model-202312021434-2-43ed80.csv, /var/folders/8v/p9jmdytd6y36kdsq55l4_cqw0000gn/T/RtmptDMyQC/beta_binomial_model-202312021434-3-43ed80.csv, /var/folders/8v/p9jmdytd6y36kdsq55l4_cqw0000gn/T/RtmptDMyQC/beta_binomial_model-202312021434-4-43ed80.csv, /var/folders/8v/p9jmdytd6y36kdsq55l4_cqw0000gn/T/RtmptDMyQC/beta_binomial_model-202312021434-5-43ed80.csv, /var/folders/8v/p9jmdytd6y36kdsq55l4_cqw0000gn/T/RtmptDMyQC/beta_binomial_model-202312021434-6-43ed80.csv, /var/folders/8v/p9jmdytd6y36kdsq55l4_cqw0000gn/T/RtmptDMyQC/beta_binomial_model-202312021434-7-43ed80.csv, /var/folders/8v/p9jmdytd6y36kdsq55l4_cqw0000gn/T/RtmptDMyQC/beta_binomial_model-202312021434-8-43ed80.csv
##
## Checking sampler transitions treedepth.
## Treedepth satisfactory for all transitions.
##
## Checking sampler transitions for divergences.
## No divergent transitions found.
##
## Checking E-BFMI - sampler transitions HMC potential energy.
## E-BFMI satisfactory.
##
## Effective sample size satisfactory.
##
## Split R-hat values satisfactory all parameters.
##
## Processing complete, no problems detected.
fit_stan$summary()
## # A tibble: 3 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lp__ -79.7 -79.4 0.714 0.319 -81.1 -79.2 1.00 36959.
## 2 theta 0.0716 0.0706 0.0148 0.0147 0.0491 0.0973 1.00 28153.
## 3 prior_theta 0.0477 0.0459 0.0164 0.0160 0.0242 0.0775 1.00 79905.
## # ℹ 1 more variable: ess_tail <dbl>
fit_stan$draws(c("theta", "prior_theta")) |>
posterior::as_draws_df() |>
ggplot(aes(.iteration, theta)) +
geom_line() +
facet_wrap(~.chain, ncol = 1)
fit_stan$draws(c("theta", "prior_theta")) |>
posterior::as_draws_df() |>
mutate(.chain = factor(.chain)) |>
ggplot(aes(.iteration, theta, color = .chain)) +
geom_line(alpha = 0.5)
Conclusiones Para Idaho, los valores de R-hat presentados son 1.000182, 1.000260 y 1.000099, todos muy cercanos al valor óptimo de 1, lo cual indica una excelente convergencia de las cadenas MCMC. Esto sugiere que las estimaciones de los parámetros son estables y confiables.
En cuanto a los valores de ess_bulk y ess_tail, también son notablemente altos para todas las estimaciones: 36959.35 y 40861.68 para la primera, 28152.81 y 34713.97 para la segunda, y 79905.39 y 78459.93 para la tercera. Estos valores indican una cantidad significativa de muestras efectivas que contribuyen a la estimación de la media y las colas de la distribución posterior, proporcionando una base sólida para las inferencias estadísticas. La alta magnitud de estos valores refleja que hay suficiente información en las muestras para obtener estimaciones precisas y que las conclusiones derivadas de estos datos son robustas y fiables para Idaho.
VIRGINIA
## get data from virginia
data_virginia <- table_polls |>
filter(state == 'virginia')
data_virginia
## # A tibble: 1 × 4
## state total_vliberal prop_vliberal n_polls
## <fct> <int> <dbl> <int>
## 1 virginia 36 0.0407 884
# fit model
fit_stan <- model$sample(
data = list(
n = data_virginia$n_polls,
y = data_virginia$total_vliberal
),
chains = 8,
parallel_chains = 4,
iter_warmup = 1000,
iter_sampling = 10000,
refresh = 500,
seed = 8
)
## Running MCMC with 8 chains, at most 4 in parallel...
##
## Chain 1 Iteration: 1 / 11000 [ 0%] (Warmup)
## Chain 1 Iteration: 500 / 11000 [ 4%] (Warmup)
## Chain 1 Iteration: 1000 / 11000 [ 9%] (Warmup)
## Chain 1 Iteration: 1001 / 11000 [ 9%] (Sampling)
## Chain 1 Iteration: 1500 / 11000 [ 13%] (Sampling)
## Chain 1 Iteration: 2000 / 11000 [ 18%] (Sampling)
## Chain 1 Iteration: 2500 / 11000 [ 22%] (Sampling)
## Chain 1 Iteration: 3000 / 11000 [ 27%] (Sampling)
## Chain 1 Iteration: 3500 / 11000 [ 31%] (Sampling)
## Chain 1 Iteration: 4000 / 11000 [ 36%] (Sampling)
## Chain 1 Iteration: 4500 / 11000 [ 40%] (Sampling)
## Chain 1 Iteration: 5000 / 11000 [ 45%] (Sampling)
## Chain 1 Iteration: 5500 / 11000 [ 50%] (Sampling)
## Chain 1 Iteration: 6000 / 11000 [ 54%] (Sampling)
## Chain 1 Iteration: 6500 / 11000 [ 59%] (Sampling)
## Chain 1 Iteration: 7000 / 11000 [ 63%] (Sampling)
## Chain 2 Iteration: 1 / 11000 [ 0%] (Warmup)
## Chain 2 Iteration: 500 / 11000 [ 4%] (Warmup)
## Chain 2 Iteration: 1000 / 11000 [ 9%] (Warmup)
## Chain 2 Iteration: 1001 / 11000 [ 9%] (Sampling)
## Chain 2 Iteration: 1500 / 11000 [ 13%] (Sampling)
## Chain 2 Iteration: 2000 / 11000 [ 18%] (Sampling)
## Chain 2 Iteration: 2500 / 11000 [ 22%] (Sampling)
## Chain 2 Iteration: 3000 / 11000 [ 27%] (Sampling)
## Chain 2 Iteration: 3500 / 11000 [ 31%] (Sampling)
## Chain 2 Iteration: 4000 / 11000 [ 36%] (Sampling)
## Chain 2 Iteration: 4500 / 11000 [ 40%] (Sampling)
## Chain 2 Iteration: 5000 / 11000 [ 45%] (Sampling)
## Chain 2 Iteration: 5500 / 11000 [ 50%] (Sampling)
## Chain 2 Iteration: 6000 / 11000 [ 54%] (Sampling)
## Chain 3 Iteration: 1 / 11000 [ 0%] (Warmup)
## Chain 3 Iteration: 500 / 11000 [ 4%] (Warmup)
## Chain 3 Iteration: 1000 / 11000 [ 9%] (Warmup)
## Chain 3 Iteration: 1001 / 11000 [ 9%] (Sampling)
## Chain 3 Iteration: 1500 / 11000 [ 13%] (Sampling)
## Chain 3 Iteration: 2000 / 11000 [ 18%] (Sampling)
## Chain 3 Iteration: 2500 / 11000 [ 22%] (Sampling)
## Chain 3 Iteration: 3000 / 11000 [ 27%] (Sampling)
## Chain 3 Iteration: 3500 / 11000 [ 31%] (Sampling)
## Chain 3 Iteration: 4000 / 11000 [ 36%] (Sampling)
## Chain 3 Iteration: 4500 / 11000 [ 40%] (Sampling)
## Chain 3 Iteration: 5000 / 11000 [ 45%] (Sampling)
## Chain 3 Iteration: 5500 / 11000 [ 50%] (Sampling)
## Chain 3 Iteration: 6000 / 11000 [ 54%] (Sampling)
## Chain 3 Iteration: 6500 / 11000 [ 59%] (Sampling)
## Chain 3 Iteration: 7000 / 11000 [ 63%] (Sampling)
## Chain 4 Iteration: 1 / 11000 [ 0%] (Warmup)
## Chain 4 Iteration: 500 / 11000 [ 4%] (Warmup)
## Chain 4 Iteration: 1000 / 11000 [ 9%] (Warmup)
## Chain 4 Iteration: 1001 / 11000 [ 9%] (Sampling)
## Chain 4 Iteration: 1500 / 11000 [ 13%] (Sampling)
## Chain 4 Iteration: 2000 / 11000 [ 18%] (Sampling)
## Chain 4 Iteration: 2500 / 11000 [ 22%] (Sampling)
## Chain 4 Iteration: 3000 / 11000 [ 27%] (Sampling)
## Chain 4 Iteration: 3500 / 11000 [ 31%] (Sampling)
## Chain 4 Iteration: 4000 / 11000 [ 36%] (Sampling)
## Chain 4 Iteration: 4500 / 11000 [ 40%] (Sampling)
## Chain 4 Iteration: 5000 / 11000 [ 45%] (Sampling)
## Chain 4 Iteration: 5500 / 11000 [ 50%] (Sampling)
## Chain 4 Iteration: 6000 / 11000 [ 54%] (Sampling)
## Chain 4 Iteration: 6500 / 11000 [ 59%] (Sampling)
## Chain 1 Iteration: 7500 / 11000 [ 68%] (Sampling)
## Chain 1 Iteration: 8000 / 11000 [ 72%] (Sampling)
## Chain 1 Iteration: 8500 / 11000 [ 77%] (Sampling)
## Chain 1 Iteration: 9000 / 11000 [ 81%] (Sampling)
## Chain 1 Iteration: 9500 / 11000 [ 86%] (Sampling)
## Chain 1 Iteration: 10000 / 11000 [ 90%] (Sampling)
## Chain 1 Iteration: 10500 / 11000 [ 95%] (Sampling)
## Chain 1 Iteration: 11000 / 11000 [100%] (Sampling)
## Chain 2 Iteration: 6500 / 11000 [ 59%] (Sampling)
## Chain 2 Iteration: 7000 / 11000 [ 63%] (Sampling)
## Chain 2 Iteration: 7500 / 11000 [ 68%] (Sampling)
## Chain 2 Iteration: 8000 / 11000 [ 72%] (Sampling)
## Chain 2 Iteration: 8500 / 11000 [ 77%] (Sampling)
## Chain 2 Iteration: 9000 / 11000 [ 81%] (Sampling)
## Chain 2 Iteration: 9500 / 11000 [ 86%] (Sampling)
## Chain 2 Iteration: 10000 / 11000 [ 90%] (Sampling)
## Chain 2 Iteration: 10500 / 11000 [ 95%] (Sampling)
## Chain 2 Iteration: 11000 / 11000 [100%] (Sampling)
## Chain 3 Iteration: 7500 / 11000 [ 68%] (Sampling)
## Chain 3 Iteration: 8000 / 11000 [ 72%] (Sampling)
## Chain 3 Iteration: 8500 / 11000 [ 77%] (Sampling)
## Chain 3 Iteration: 9000 / 11000 [ 81%] (Sampling)
## Chain 3 Iteration: 9500 / 11000 [ 86%] (Sampling)
## Chain 3 Iteration: 10000 / 11000 [ 90%] (Sampling)
## Chain 3 Iteration: 10500 / 11000 [ 95%] (Sampling)
## Chain 3 Iteration: 11000 / 11000 [100%] (Sampling)
## Chain 4 Iteration: 7000 / 11000 [ 63%] (Sampling)
## Chain 4 Iteration: 7500 / 11000 [ 68%] (Sampling)
## Chain 4 Iteration: 8000 / 11000 [ 72%] (Sampling)
## Chain 4 Iteration: 8500 / 11000 [ 77%] (Sampling)
## Chain 4 Iteration: 9000 / 11000 [ 81%] (Sampling)
## Chain 4 Iteration: 9500 / 11000 [ 86%] (Sampling)
## Chain 4 Iteration: 10000 / 11000 [ 90%] (Sampling)
## Chain 4 Iteration: 10500 / 11000 [ 95%] (Sampling)
## Chain 4 Iteration: 11000 / 11000 [100%] (Sampling)
## Chain 1 finished in 0.3 seconds.
## Chain 2 finished in 0.3 seconds.
## Chain 3 finished in 0.3 seconds.
## Chain 4 finished in 0.3 seconds.
## Chain 5 Iteration: 1 / 11000 [ 0%] (Warmup)
## Chain 5 Iteration: 500 / 11000 [ 4%] (Warmup)
## Chain 5 Iteration: 1000 / 11000 [ 9%] (Warmup)
## Chain 5 Iteration: 1001 / 11000 [ 9%] (Sampling)
## Chain 5 Iteration: 1500 / 11000 [ 13%] (Sampling)
## Chain 5 Iteration: 2000 / 11000 [ 18%] (Sampling)
## Chain 5 Iteration: 2500 / 11000 [ 22%] (Sampling)
## Chain 5 Iteration: 3000 / 11000 [ 27%] (Sampling)
## Chain 5 Iteration: 3500 / 11000 [ 31%] (Sampling)
## Chain 5 Iteration: 4000 / 11000 [ 36%] (Sampling)
## Chain 5 Iteration: 4500 / 11000 [ 40%] (Sampling)
## Chain 5 Iteration: 5000 / 11000 [ 45%] (Sampling)
## Chain 5 Iteration: 5500 / 11000 [ 50%] (Sampling)
## Chain 5 Iteration: 6000 / 11000 [ 54%] (Sampling)
## Chain 5 Iteration: 6500 / 11000 [ 59%] (Sampling)
## Chain 5 Iteration: 7000 / 11000 [ 63%] (Sampling)
## Chain 5 Iteration: 7500 / 11000 [ 68%] (Sampling)
## Chain 6 Iteration: 1 / 11000 [ 0%] (Warmup)
## Chain 6 Iteration: 500 / 11000 [ 4%] (Warmup)
## Chain 6 Iteration: 1000 / 11000 [ 9%] (Warmup)
## Chain 6 Iteration: 1001 / 11000 [ 9%] (Sampling)
## Chain 6 Iteration: 1500 / 11000 [ 13%] (Sampling)
## Chain 6 Iteration: 2000 / 11000 [ 18%] (Sampling)
## Chain 6 Iteration: 2500 / 11000 [ 22%] (Sampling)
## Chain 6 Iteration: 3000 / 11000 [ 27%] (Sampling)
## Chain 6 Iteration: 3500 / 11000 [ 31%] (Sampling)
## Chain 6 Iteration: 4000 / 11000 [ 36%] (Sampling)
## Chain 6 Iteration: 4500 / 11000 [ 40%] (Sampling)
## Chain 6 Iteration: 5000 / 11000 [ 45%] (Sampling)
## Chain 6 Iteration: 5500 / 11000 [ 50%] (Sampling)
## Chain 6 Iteration: 6000 / 11000 [ 54%] (Sampling)
## Chain 6 Iteration: 6500 / 11000 [ 59%] (Sampling)
## Chain 6 Iteration: 7000 / 11000 [ 63%] (Sampling)
## Chain 7 Iteration: 1 / 11000 [ 0%] (Warmup)
## Chain 7 Iteration: 500 / 11000 [ 4%] (Warmup)
## Chain 7 Iteration: 1000 / 11000 [ 9%] (Warmup)
## Chain 7 Iteration: 1001 / 11000 [ 9%] (Sampling)
## Chain 7 Iteration: 1500 / 11000 [ 13%] (Sampling)
## Chain 7 Iteration: 2000 / 11000 [ 18%] (Sampling)
## Chain 7 Iteration: 2500 / 11000 [ 22%] (Sampling)
## Chain 7 Iteration: 3000 / 11000 [ 27%] (Sampling)
## Chain 7 Iteration: 3500 / 11000 [ 31%] (Sampling)
## Chain 7 Iteration: 4000 / 11000 [ 36%] (Sampling)
## Chain 7 Iteration: 4500 / 11000 [ 40%] (Sampling)
## Chain 7 Iteration: 5000 / 11000 [ 45%] (Sampling)
## Chain 7 Iteration: 5500 / 11000 [ 50%] (Sampling)
## Chain 8 Iteration: 1 / 11000 [ 0%] (Warmup)
## Chain 8 Iteration: 500 / 11000 [ 4%] (Warmup)
## Chain 8 Iteration: 1000 / 11000 [ 9%] (Warmup)
## Chain 8 Iteration: 1001 / 11000 [ 9%] (Sampling)
## Chain 8 Iteration: 1500 / 11000 [ 13%] (Sampling)
## Chain 8 Iteration: 2000 / 11000 [ 18%] (Sampling)
## Chain 8 Iteration: 2500 / 11000 [ 22%] (Sampling)
## Chain 8 Iteration: 3000 / 11000 [ 27%] (Sampling)
## Chain 8 Iteration: 3500 / 11000 [ 31%] (Sampling)
## Chain 8 Iteration: 4000 / 11000 [ 36%] (Sampling)
## Chain 8 Iteration: 4500 / 11000 [ 40%] (Sampling)
## Chain 8 Iteration: 5000 / 11000 [ 45%] (Sampling)
## Chain 8 Iteration: 5500 / 11000 [ 50%] (Sampling)
## Chain 5 Iteration: 8000 / 11000 [ 72%] (Sampling)
## Chain 5 Iteration: 8500 / 11000 [ 77%] (Sampling)
## Chain 5 Iteration: 9000 / 11000 [ 81%] (Sampling)
## Chain 5 Iteration: 9500 / 11000 [ 86%] (Sampling)
## Chain 5 Iteration: 10000 / 11000 [ 90%] (Sampling)
## Chain 5 Iteration: 10500 / 11000 [ 95%] (Sampling)
## Chain 5 Iteration: 11000 / 11000 [100%] (Sampling)
## Chain 6 Iteration: 7500 / 11000 [ 68%] (Sampling)
## Chain 6 Iteration: 8000 / 11000 [ 72%] (Sampling)
## Chain 6 Iteration: 8500 / 11000 [ 77%] (Sampling)
## Chain 6 Iteration: 9000 / 11000 [ 81%] (Sampling)
## Chain 6 Iteration: 9500 / 11000 [ 86%] (Sampling)
## Chain 6 Iteration: 10000 / 11000 [ 90%] (Sampling)
## Chain 6 Iteration: 10500 / 11000 [ 95%] (Sampling)
## Chain 6 Iteration: 11000 / 11000 [100%] (Sampling)
## Chain 7 Iteration: 6000 / 11000 [ 54%] (Sampling)
## Chain 7 Iteration: 6500 / 11000 [ 59%] (Sampling)
## Chain 7 Iteration: 7000 / 11000 [ 63%] (Sampling)
## Chain 7 Iteration: 7500 / 11000 [ 68%] (Sampling)
## Chain 7 Iteration: 8000 / 11000 [ 72%] (Sampling)
## Chain 7 Iteration: 8500 / 11000 [ 77%] (Sampling)
## Chain 7 Iteration: 9000 / 11000 [ 81%] (Sampling)
## Chain 7 Iteration: 9500 / 11000 [ 86%] (Sampling)
## Chain 7 Iteration: 10000 / 11000 [ 90%] (Sampling)
## Chain 7 Iteration: 10500 / 11000 [ 95%] (Sampling)
## Chain 8 Iteration: 6000 / 11000 [ 54%] (Sampling)
## Chain 8 Iteration: 6500 / 11000 [ 59%] (Sampling)
## Chain 8 Iteration: 7000 / 11000 [ 63%] (Sampling)
## Chain 8 Iteration: 7500 / 11000 [ 68%] (Sampling)
## Chain 8 Iteration: 8000 / 11000 [ 72%] (Sampling)
## Chain 8 Iteration: 8500 / 11000 [ 77%] (Sampling)
## Chain 8 Iteration: 9000 / 11000 [ 81%] (Sampling)
## Chain 8 Iteration: 9500 / 11000 [ 86%] (Sampling)
## Chain 8 Iteration: 10000 / 11000 [ 90%] (Sampling)
## Chain 8 Iteration: 10500 / 11000 [ 95%] (Sampling)
## Chain 8 Iteration: 11000 / 11000 [100%] (Sampling)
## Chain 5 finished in 0.3 seconds.
## Chain 6 finished in 0.3 seconds.
## Chain 7 Iteration: 11000 / 11000 [100%] (Sampling)
## Chain 7 finished in 0.3 seconds.
## Chain 8 finished in 0.3 seconds.
##
## All 8 chains finished successfully.
## Mean chain execution time: 0.3 seconds.
## Total execution time: 0.8 seconds.
# diagnose
fit_stan$cmdstan_diagnose()
## Processing csv files: /var/folders/8v/p9jmdytd6y36kdsq55l4_cqw0000gn/T/RtmptDMyQC/beta_binomial_model-202312021434-1-1b9ebd.csv, /var/folders/8v/p9jmdytd6y36kdsq55l4_cqw0000gn/T/RtmptDMyQC/beta_binomial_model-202312021434-2-1b9ebd.csv, /var/folders/8v/p9jmdytd6y36kdsq55l4_cqw0000gn/T/RtmptDMyQC/beta_binomial_model-202312021434-3-1b9ebd.csv, /var/folders/8v/p9jmdytd6y36kdsq55l4_cqw0000gn/T/RtmptDMyQC/beta_binomial_model-202312021434-4-1b9ebd.csv, /var/folders/8v/p9jmdytd6y36kdsq55l4_cqw0000gn/T/RtmptDMyQC/beta_binomial_model-202312021434-5-1b9ebd.csv, /var/folders/8v/p9jmdytd6y36kdsq55l4_cqw0000gn/T/RtmptDMyQC/beta_binomial_model-202312021434-6-1b9ebd.csv, /var/folders/8v/p9jmdytd6y36kdsq55l4_cqw0000gn/T/RtmptDMyQC/beta_binomial_model-202312021434-7-1b9ebd.csv, /var/folders/8v/p9jmdytd6y36kdsq55l4_cqw0000gn/T/RtmptDMyQC/beta_binomial_model-202312021434-8-1b9ebd.csv
##
## Checking sampler transitions treedepth.
## Treedepth satisfactory for all transitions.
##
## Checking sampler transitions for divergences.
## No divergent transitions found.
##
## Checking E-BFMI - sampler transitions HMC potential energy.
## E-BFMI satisfactory.
##
## Effective sample size satisfactory.
##
## Split R-hat values satisfactory all parameters.
##
## Processing complete, no problems detected.
fit_stan$summary()
## # A tibble: 3 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lp__ -1.83e+2 -1.83e+2 0.717 0.316 -1.85e+2 -1.83e+2 1.00 35571.
## 2 theta 4.18e-2 4.14e-2 0.00619 0.00616 3.21e-2 5.24e-2 1.00 27839.
## 3 prior_theta 4.76e-2 4.57e-2 0.0164 0.0160 2.41e-2 7.74e-2 1.00 78779.
## # ℹ 1 more variable: ess_tail <dbl>
fit_stan$draws(c("theta", "prior_theta")) |>
posterior::as_draws_df() |>
ggplot(aes(.iteration, theta)) +
geom_line() +
facet_wrap(~.chain, ncol = 1)
fit_stan$draws(c("theta", "prior_theta")) |>
posterior::as_draws_df() |>
mutate(.chain = factor(.chain)) |>
ggplot(aes(.iteration, theta, color = .chain)) +
geom_line(alpha = 0.5)
Conclusiones Para Virginia, los valores de R-hat son 1.000110, 1.000185 y 1.000127 para las tres estimaciones respectivamente. Estos valores están extremadamente cerca de 1, lo cual es ideal, indicando que las cadenas MCMC han convergido muy bien y que las estimaciones de los parámetros son confiables.
Los valores de ess_bulk y ess_tail son excepcionalmente altos en todos los casos, con 35570.52 y 39194.73 para la primera estimación, 27839.46 y 34102.24 para la segunda, y 78778.66 y 78495.77 para la tercera. Tales valores altos para ess_bulk y ess_tail sugieren que hay una gran cantidad de muestras efectivas que contribuyen a la estimación de la media y la cola de la distribución posterior. Esto refleja una precisión estadística muy alta para las estimaciones realizadas, lo que significa que los resultados son robustos y las inferencias basadas en estas muestras son muy fiables para Virginia.
# join data
table_obama_posterior <- table_polls |>
inner_join(table_stats_obama, by = c("state")) |>
mutate(
prop_vliberal = mean
)
# plot
table_obama_posterior |>
mutate(method = "bayesian") |>
select(state, total_vliberal, prop_vliberal, n_polls, method) |>
add_row(
table_polls |> mutate(method = "frequentist") |> select(state, total_vliberal, prop_vliberal, n_polls, method)
) |>
mutate(state_abr = str_sub(state, 1, 2)) |>
ggplot(aes(x=n_polls, y=prop_vliberal, color=method)) +
geom_point() +
geom_text(aes(label=state_abr), hjust=0, vjust=0) +
theme_minimal()
Conclusiones Las estimaciones de la proporción de votantes
liberales parecen ser más controladas cuando se utiliza el método
bayesiano en comparación con el método frecuentista. Los puntos rojos
representativos del método bayesiano en el gráfico muestran una
dispersión reducida, lo que indica una variabilidad menor en las
estimaciones a lo largo de diferentes estados, independientemente del
número de encuestas realizadas. Esto podría reflejar la capacidad del
método bayesiano de integrar efectivamente información previa y
gestionar mejor la incertidumbre, resultando en estimaciones más
coherentes y confiables de la proporción de votantes liberales.
# graph scatter between prop_vliberal and vote_Obama_pct, using the point estimate from table_stats_obama
df_elections |>
mutate(state = tolower(state)) |>
inner_join(table_obama_posterior, by=c("state")) |>
select(state, prop_vliberal, vote_Obama_pct) |>
mutate(state_abr = str_sub(state, 1, 2)) |>
ggplot(aes(x=vote_Obama_pct, y=prop_vliberal)) +
geom_point() +
geom_text(aes(label=state), hjust=0, vjust=0) +
theme_minimal()
Conclusiones El gráfico muestra una comparación entre la proporción de votantes liberales y el porcentaje de votos para Obama, con cada punto representando un estado. Los datos revelan una correlación positiva: los estados con un mayor porcentaje de votos para Obama también tienden a tener una proporción más alta de votantes liberales. Esto refleja el comportamiento electoral esperado, donde los votantes con tendencias liberales son más propensos a apoyar a un candidato demócrata. La disposición de los estados en el gráfico sugiere que aquellos con inclinaciones políticas liberales dieron un fuerte apoyo a Obama, mientras que los estados con menos votantes liberales tendieron a tener menores porcentajes de votos para él.
Nota: En problemas como este, donde estamos estimando un parámetro para cada grupo (estado e nuestro caso) podemos optar por un modelo jerárquico, en donde la distribución de las \(\theta_j\) no esta dada por la incial sino que se modela con un nivel adicional, cuyos parámetros se estiman con los datos y tienen a su vez una distribución incial:
\[y_j|\theta_j \sim Binomial(n_j, \theta_j)\]
\[\theta_j \sim Beta(\alpha, \beta) \]
\[\alpha \sim g(a_o), \beta \sim f(b_0)\]
donde \(g(a_0)\) y \(f(b_0)\) son las inciales seleccionadas con conocimiento experto.
Román: Gracias Teresa por el curso, me gustó mucho y aprendí un montón. Me ha sido muy útil en mi trabajo. La mejor de las suertes el siguiente sexenio en el conteo rápido! Sofia: Tere gracias por un curso tan importante e interesante. Lo disfruté muchísimo. Suerte el próximo semestre!